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1 .  Introduction 

Classical  (centralized)  theories  of  decision  making  and  computation  deal 
with  the  situation  in  which  a  single  decision  maker  (man  or  machine)  possesses 
(or  collects)  all  available  information  related  to  a  certain  system  and  has  to 
perform  some  computations  and/or  make  a  decision  so  as  to  achieve  a  certain 
objective.  In  mathematical  terms,  the  decision  problem  is  usually  expressed  as  a 
problem  of  choosing  a  decision  function  that  transforms  elements  of  the  information 
space  into  elements  of  the  decision  space  so  as  to  minimize  a  cost  function.  From 
the  point  of  view  of  the  theory  of  computation,  we  are  faced  with  the  problem  of 
designing  a  serial  algorithm  which  actually  computes  the  desired  decision. 

Many  real  world  systems  however,  such  as  power  systems,  communication  networks, 
large  manufacturing  systems,  public  or  business  organizations,  are  too  large  for 
the  classical  model  of  decision  making  to  be  applicable.  There  may  be  a  multitude 
of  decision  makers  (or  processors),  none  of  which  possesses  all  relevant  knowledge 
because  this  is  impractical,  inconvenient,  or  expensive  due  to  limitations  of  the 
system's  communication  channels,  memory,  or  computation  and  information  processing 
capabilities. 

In  other  cases  the  designer  may  deliberately  introduce  multiple  processors 
into  a  system  in  view  of  the  potential  significant  advantages  offered  by  distributed 
computation.  For  problems  where  processing  speed  is  a  major  bottleneck  distributed 
computing  systems  may  offer  increases  in  throughput  that  are  either  unattainable  or 
prohibitively  expensive  using  a  single  processor.  For  problems  where  reliability 
or  survivability  is  a  major  concern  distributed  systems  can  offer  increased  fault 
tolerance  or  more  graceful  performance  degradation  in  the  face  of  various  kinds  of 
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equipment  failures.  Finally  as  the  cost  of  computation  has  decreased  dramatically 
relative  to  the  cost  of  communication  it  is  now  advantageous  to  trade  off  increased 
computation  for  reduced  communication.  Thus  in  database  or  sensor  systems  involving 
geographically  separated  data  collection  points  it  may  be  advantageous  to  process 
data  locally  at  the  point  of  collection  and  send  condensed  summaries  to  other  points 
as  needed  rather  than  communicate  the  raw  data  to  a  single  processing  center. 

For  these  reasons,  we  will  be  interested  in  schemes  for  distributed  decision 
making  and  computation  in  which  a  set  of  processors  (or  decision  makers)  eventually 
compute  a  desired  solution  through  a  process  of  information  exchange.  It  is  possible 
to  formulate  mathematically  a  distributed  decision  problem  whereby  one  tries  to 
choose  an  "optimal"  distributed  scheme,  subject  to  certain  limitations.  For  example, 
we  may  impose  constraints  on  the  amount  of  information  that  may  be  transferred  and 
look  for  a  scheme  which  results  in  the  best  achievable  decisions,  given  these  cons¬ 
traints.  Such  problems  have  been  formulated  and  studied  in  the  decentralized  control 
context  [21,22],  as  well  as  in  the  computer  science  literature  [23,24].  However, 
in  practice  these  turn  out  to  be  very  difficult,  usually  intractable  problems  [25,26]. 
We,  therefore,  choose  to  focus  on  distributed  algorithms  with  a  prespecified  structure 
(rather  than  try  to  find  an  optimal  structure) :  we  assume  that  each  processor  chooses 
an  initial  decision  and  iteratively  improves  this  decision  as  more  information  is 
obtained  from  the  environment  or  other  processors.  By  this  we  mean  that  the  ith 
processor  updates  from  time  to  time  his  decision  x1  using  some  formula 

xi*-fi(xi,Ix)  (1.1) 

1 

where  I1  is  the  information  available  to  the  ith  processor  at  the  time  of  the 
update.  In  general  there  are  serious  limitations  to  this  approach  the  most  obvious 
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of  which  is  that  the  function  f1  in  (1.1)  has  to  be  chosen  a  priori  on  the  basis 
of  ad  hoc  considerations.  However  there  are  situations  where  the  choice  of 
reasonable  functions  f1  is  not  too  dificult,  and  iterations  such  as  (1.1)  can 
provide  a  practical  approach  to  an  otherwise  very  difficult  problem.  After  all, 

I 

centralized  counterparts  of  processes  such  as  (1.1)  are  of  basic  importance  in  the 
study  of  stability  of  dynamic  systems,  and  deterministic  and  stochastic  optimization 
algorithms . 

In  most  of  the  cases  we  consider  the  information  I1  of  processor  i  contains  some 
past  decisions  of  other  processors.  However,  we  allow  the  possibility  that  some 
processors  perform  computations  (using  (1.1))  more  often  than  they  exchange  information, 
in  which  case  the  information  I1  may  be  outdated.  This  allows  us  to  model  situations 
frequently  encountered  in  large  systems  where  it  is  difficult  to  maintain  synchroniza¬ 
tion  between  various  parts  of  the  decision  making  and  information  gathering  processes. 

There  are  a  number  of  characteristics  and  issues  relating  to  the  distributed 
iterative  process  (1.1)  that  either  do  not  arise  in  connection  with  its  centralized 
counterpart  or  else  appear  in  milder  form.  First  there  is  a  graph  structure  charac¬ 
terizing  the  interprocessor  flow  of  information.  Second  there  is  an  expanded  notion 
of  the  state  of  computation  characterized  by  the  current  results  of  computation  x1 
and  the  latest  information  I1  available  at  the  entire  collection  of  processors  i. 

Finally  when (as  we  assume  in  this  paper) there  is  no  strict  sequence  according  to 
which  computation  and  communication  takes  place  at  the  various  processors  the  state  of 
computation  tends  to  evolve  according  to  a  point-to-set  mapping  and  possibly  in  a 
probabilistic  manner  since  each  state  of  computation  may  give  rise  to  many  other  states 
depending  on  which  of  the  processors  executes  iteration  (1.1)  next  and  depending  on 
possibly  random  exogenous  information  made  available  at  the  processors  during  execution 
of  the  algorithm. 
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From  the  point  of  view  of  applications,  we  can  see  several  possible  (broadly 
defined)  areas.  We  discuss  below  some  of  them,  although  this  is  not  meant  to 
be  an  exhaustive  list. 

a)  Parallel  computing  systems,  possibly  designed  for  a  special  purpose,  e.g. 
for  solving  large  scale  mathematical  programming  problems  with  a  particular 
structure.  An  important  distinguishing  feature  of  such  systems  is  that  the 
machine  architecture  is  usually  under  the  control  of  the  designer.  As  mentioned 
above,  we  will  assume  a  prespecified  structure,  thereby  bypassing  issues  of 
architectural  choice.  However,  the  work  surveyed  in  this  paper  can  be  useful 

for  assessing  the  effects  of  communication  delays  and  of  the  lack  of  synchronization 
in  some  parallel  computing  systems.  Some  of  the  early  work  on  the  subject  [10],  [11] 
is  motivated  by  such  systems.  For  a  discussion  of  related  issues  see  [7], 

b)  Data  Communication  Networks.  Real  time  data  network  operation  lends  itself 
naturally  to  application  of  distributed  algorithms.  The  structure  needed  for  dis¬ 
tributed  computation  (geographically  distributed  processors  connected  by  communication 
links)  is  an  inherent  part  of  the  system.  Information  such  as  link  message  flows, 
origin  to  destination  data  rates,  and  link  and  node  failures  is  collected  at 
geographically  distributed  points  in  the  network.  It  is  generally  difficult  to 
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implement  centralized  algorithms  whereby  a  single  node  would  collect  all 
information  needed,  make  decisions,  and  transmit  decisions  back  to  the  points  of 
interest.  The  amount  of  data  processing  required  of  the  central  node  may  be  too 
large.  In  addition  the  links  over  which  information  is  transmitted  to  and 
from  the  central  node  are  subject  to  failure  thereby  compounding  the  difficulties. 

For  these  reasons  in  many  networks  (e.g.  the  ARPANET)  algorithms  such  as  routing, 
flow  control,  and  failure  recovery  are  carried  out  in  distributed  fashion  [1 ] - [5] . 

Since  maintaining  synchronization  in  a  large  data  network  generally  poses 
implementation  difficulties  these  algorithms  are  often  operated  asynchronously. 

c)  Distributed  Sensor  Networks  and  Signal  Processing.  Suppose  that  a  set  of 
sensors  obtain  noisy  measurements  (or  a  sequence  of  measurements)  of  a  stochastic 
signal  and  then  exchange  messages  with  the  purpose  of  computing  a  final  estimate 
or  identifying  some  unknown  parameters.  We  are  then  interested  in  a  scheme  by 
which  satisfactory  estimates  are  produced  without  requiring  that  each  sensor  com¬ 
municates  his  detailed  information  to  a  central  processor.  Some  approaches  that 
have  been  tried  in  this  context  may  be  found  in  [27,28,29,30]. 

d)  Large  Decentralized  Systems  and  Organizations.  There  has  been  much  interest, 
particularly  in  economics,  in  situations  in  which  a  set  of  rational  decision  makers 
make  decisions  and  then  update  them  on  the  basis  of  new  information.  Arrow  and 
Hurwicz  [31]  have  suggested  a  parallelism  between  the  operation  of  an  economic  market 
and  distributed  computation.  In  this  context  the  study  of  distributed  algorithms  may  be 
viewed  as  an  effort  to  model  collective  behavior.  Similar  models  have  been  proposed 

for  biological  systems,  [32].  Alternatively,  finding  good  distributed  algorithms 
and  studying  their  communication  requirements  may  yield  insights  on  good  ways  of 
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designing  large  organizations.  It  should  be  pointed  out  that  there  is  an  open 
debate  concerning  the  degree  of  rationality  that  may  be  assumed  for  human  decision 
makers.  Given  the  cognitive  limitations  of  humans,  it  is  fair  to  say  that  only 
relatively  simple  algorithms  can  be  meaningful  in  such  contexts.  The  algorithms 
considered  in  this  paper  tend  to  be  simple  particularly  when  compared  with  other 

algorithms  where  decision  makers  attempt  to  process  optimally  the  available  information. 

There  are  several  broad  methodological  issues  associated  with  iterative 

distributed  algorithms  such  as  correctness,  computation  or  communication  efficiency, 
and  robustness.  In  this  paper  we  will  focus  on  two  issues  that 
generally  relate  to  the  question  of  validity  of  an  algorithm. 

a)  Under  what  conditions  is  it  possible  to  guarantee  asymptotic  convergence 
of  the  iterates  x1  for  all  processors  i,  and  asymptotic  agreement  between  different 
processors  i  and  j  [ (x1-xJ )-K)] ? 

b)  How  much  synchronization  between  processor  computations  is  needed  in 
order  to  guarantee  asymptotic  convergence  or  agreement? 

Significant  progress  has  been  made  recently  towards  understanding  these 
issues  and  the  main  purpose  of  this  piper  is  to  survey  this  work.  On  the  other 
hand  little  is  known  at  present  regarding  issues  such  as  speed  of  convergence,  and 
assessment  of  the  value  of  communicated  information  in  a  distributed  context.  As 
a  result  we  will  not  touch  upon  these  topics  in  the  present  paper.  Moreover,  there 
are  certain  settings  (e.g.,  decentralized  control  of  dynamical  systems, dynamic 
routing  in  data  networks)  in  which  issues  of  asymptotic  convergence  and  agreement 
do  not  arise.  Consequently,  the  work  surveyed  here  is  not  of  direct  relevance  to 
such  situations. 

In  the  next  two  sections  we  formulate  a  model  of  distributed  asynchronous 
iterative  computation,  and  illustrate  its  relevance  by  means  of  a  variety  of  examples 
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from  optimization,  parameter  estimation,  and  communication  networks.  The  model 
bears  similarity  to  models  of  chaotic  relaxation  and  distributed  asynchronous 
fixed  point  computation  [10] -[13]  but  is  more  general  in  two  respects.  First  we 
allow  two  or  more  processors  to  update  separately  estimates  of  the  same  coordinate 
of  the  decision  vector  and  combine  their  individual  estimates  by  taking  convex 
combinations,  or  otherwise.  Second  we  allow  processors  to  receive  possibly  stochas¬ 
tic  measurements  from  the  environment  which  may  depend  in  nonlinear  fashion  on 
estimates  of  other  processors.  These  generalizations  broaden  a  great  deal  the 
range  of  applicability  of  the  model  over  earlier  formulations. 

In  Sections  4  and  5  we  discuss  two  distinct  approaches  for  analyzing  algo¬ 
rithmic  convergence.  The  first  approach  is  essentially  a  generalization  of  the 
Lyapounov  function  method  for  proving  convergence  of  centralized  iterative  processes. 
The  second  approach  is  based  on  the  idea  that  if  the  processors  communicate  fast 
relative  to  the  speed  of  convergence  of  computation  then  their  solution  estimates 
will  be  close  to  the  path  of  a  certain  centralized  process.  By  analyzing  the 
convergence  of  this  latter  process  one  can  draw  inferences  about  the  convergence 
of  the  distributed  process.  In  Section  5  we  present  results  related  primarily  to 
deterministic  and  stochastic  descent  optimization  algorithms.  An  analysis  that 
parallels  Ljung's  ODE  approach  [37],  [38]  to  recursive  stochastic  algorithms  may  be 
found  in  [35]  and  in  a  forthcoming  publication.  In  Section  6  we  discuss  convergence 
and  agreement  results  for  a  special  class  of  distributed  processes  in  which  the 
update  of  each  processor,  at  any  given  time,  is  the  optimal  estimate  of  a  solution 
given  his  information,  in  the  sense  that  it  minimizes  the  conditional  expectation 
of  a  common  cost  function. 
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2.  A  Distributed  Iterative  Computation  Model 

In  our  model  we  are  given  a  set  of  feasible  decisions  X  and  we  are  interested 

in  finding  an  element  of  a  special  subset  X*  called  the  solution  set.  We  do  not 

specify  X*  further  for  the  time  being.  An  element  of  X*  will  be  referred  to  as  a 

solution.  Without  loss  of  generality  we  index  all  events  of  interest  (message 

transmissions  and  receptions,  obtaining  measurements,  performing  computations)  by 

an  integer  time  variable  t.  There  is  a  finite  collection  of  processors  i=l,...,n 

each  of  which  maintains  an  estimate  xl(t)e  X  of  a  solution  and  updates  it  once 

in  a  while  according  to  a  scheme  to  be  described  shortly.  The  ith  processor 

receives  also  from  time  to  time  hk  different  types  of  measurements  and  maintains 

the  latest  values  zx,...,zx  of  these  measurements.  (That  is,  if  no  measure- 
i  z  m. 

l 

ment  of  type  j  is  received  at  time  t,  then  z^(t+l)  =  z^(t)).  The  measurement 
z^  is  an  element  of  a  set  Z^.  Each  time  a  measurement  z^  of  type  j  is  received 
by  processor  i  the  old  value  z^  is  replaced  by  the  new  value  and  the  estimate  x1 
is  updated  according  to 


xX(t+l)  =  M.-U^thzjtt) . 2*  (t))  , 


(2.1) 


where  is  a  given  function.  Each  node  i  also  updates  from  time  to  time  the 
estimate  x1  according  to 


xX(t*l)  *Ci(x1(t),  zj(t),...,z^(t)) 


(2.2) 


where  is  a  given  function.  Thus  at  each  time  t  each  processor  i  either 
receives  a  new  measurement  of  type  j  and  updates  x1  according  to  (2.1),  or 
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updates  x1  according  to  (2.2),  or  remains  idle  in  which  case  x1(t+l)  *  xX(t) 
and  z*(t+l)  =  Zj(t)  for  all  j.  The  sequence  according  to  which  a  processor 
executes  (2.1)  or  (2.2)  or  remains  idle  is  left  unspecified  and  indeed  much  of 
the  analysis  in  this  paper  is  oriented  towards  the  case  where  there  is  considerable 
a  priori  uncertainty  regarding  this  sequence.  One  of  the  advantages  of  this 
approach  is  that  difficult  analytical  problems  arising  due  to  consideration  of  non- 
classical  information  patterns  [21]  do  not  appear  in  our  framework.  Note  that 
neither  mapping  NT  ^  or  C  involves  a  dependence  on  the  time  argument  t.  This  is 
appropriate  since  it  would  be  too  restrictive  to  assume  that  all  processors  have 
access  to  a  global  clock  that  records  the  current  time  index  t.  On  the  other  hand 
the  mappings  M„  and  may  include  dependences  on  local  clocks  (or  counters)  that 
record  the  number  of  times  iterations  (2.1)  or  (2.2)  are  executed  at  processor  i.  The 
value  of  the  local  counter  of  processor  i  may  be  artificially  lumped  as  an  additional 
component  into  the  estimate  x1  and  incremented  each  time  (2.1)  or  (2.2)  are  executed. 

Note  that  there  is  redundancy  in  introducing  the  update  formula  (2.2)  in  ad¬ 
dition  to  (2.1).  We  could  view  (2.2)  as  a  special  case  of  (2.1)  corresponding  to 
an  update  in  response  to  a  "self-generated"  measurement  at  node  i.  Indeed  such  a 
formulation  may  be  appropriate  in  some  problems.  On  the  other  hand  there  is  often 
some  conceptual  value  in  separating  the  types  of  updates  at  a  processor  in  updates 
that  incorporate  new  exogenous  information  (cf.  (2.1)),  and  updates  that  utilize 
the  existing  information  to  improve  the  processor's  estimate  (cf.  (2.2)). 

The  measurement  z*(t),  received  by  processor  i  at  time  t,  is  related  to  the 
processor  estimatesx*,x2, . . . ,xn  according  to  an  equation  of  the  form 


zj(t)  *  «fr..(x1(Tj1(t)),X2(Tj2(t)) . xn0cjn(t)),U>), 

where  u belongs  to  the  sample  space  ft  corresponding  to  a  probability  space 

(n.F.p). 


i 


(2.3) 
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We  allow  the  presence  of  delays  in  equation  (2.3)  in  the  sense  that  the 
estimates  xl,...,xn  may  be  the  ones  generated  via  (2.1)  or  (2.2)  at  the  corres- 

llf  { 

ponding  processors  at  some  times  (t)£  t,  prior  to  the  time  t  that  z^  (t)  was 
received  at  processor  i.  Furthermore  the  delays  may  be  different  for  different 
processors.  We  place  the  following  restriction  on  these  delays  which  essentially 
says  that  successive  measurements  of  the  same  type  depend  on  successive  processor 
estimates. 


Assumption  2.1:  If  t>t',  then 

TjkCt)>  T^kCtfD ,  Vij.k  . 

For  the  time  being,  the  only  other  assumption  regarding  the  tim.  and 
sequencing  of  measurement  reception  and  estimate  generation  is  the  ft  wing: 

Assumption  2.2  (Continuing  Update  Assumption) :  For  any  i  and  j  and  any  time  t 
there  exists  a  time  t’>t  at  which  a  measurement  z*  of  the  form  (2.3)  will  be 
received  at  i  and  the  estimate  x3  will  be  updated  according  to  (2.1).  Also  for 
any  i  and  time  t  there  exists  a  time  t">t  at  which  the  estimate  x1  will  be  updated 
according  to  (2.2). 

The  assumption  essentially  states  that  each  processor  will  continue  to 
receive  measurements  in  the  future  and  update  his  estimate  according  to  (2.1) 
and  (2.2).  Given  that  we  are  interested  in  asymptotic  results  there  isn't  much  we 
can  hope  to  prove  without  an  assumption  of  this  type.  In  order  to  formulate 
substantive  convergence  results  we  will  also  need  further  assumptions  on  the  nature 
of  the  mappings  M„,  C^,  and  and  possibly  on  the  relative  timing  of  measurement 
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receptions,  estimate  updates  and  delays  in  (2.3)  and  these  will  be  introduced 
later.  In  the  next  section  we  illustrate  the  model  and  its  potential  uses  by 
means  of  examples. 

It  should  be  pointed  out  here  that  the  above  model  is  very  broad  and  may 
capture  a  large  variety  of  different  situations,  provided  that  the  measurements 
z*  are  given  appropriate  interpretations.  For  example,  the  choice  z*(t)  =  x^(tV(t)) 
corresponds  to  a  situation  where  processor  i  receives  a  message  with  the  estimate 
computed  by  processor  j  at  time  T^(t),  and  t-T^(t)  may  be  viewed  as  a 
communication  delay.  In  this  case  processors  act  also  as  sensors  generating 
measurements  for  other  processors.  In  other  situations  however  specialized  sensors 
may  generate  (possibly  noisy  and  delayed)  feedback  to  the  processors  regarding 
estimates  of  other  processors  of  (cf.  (2.3)).  Examples  of  both  of  these  situations 
will  be  given  in  the  next  section. 

3.  Examples 

An  important  special  case  of  the  model  of  the  previous  section  is  when  the 
feasible  set  X  is  the  Cartesian  product  of  n  sets 

X  =  X  x  X  x. . .x  X  , 

12  n 

each  processor  i  is  assigned  the  responsibility  of  updating  the  ith  component 
of  the  decision  vector  x  =  (x^.x^, . . . ,x^)  via  (2.2)  while  receiving  from  each 
processor  j  (j^i)  the  value  of  the  jth  component  x^ .  We  refer  to  such  distributed 
processes  as  being  specialized.  The  first  five  examples  are  of  this  type. 
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Example  1:  (Shortest  Path  Computation) 

Let  (N,A)  be  a  directed  graph  with  set  of  nodes  N=>{1,2, . . .  ,n)and  set  of  links 
A.  Let  N(j)  denote  the  set  of  downstream  neighbors  of  node  i,  i.e.  the  nodes  j 
such  that  (i,j)  is  a  link.  Assume  that  each  link  (i,j)  is  assigned  a  positive 
scalar  a^.  referred  to  as  its  length.  Assume  also  that  there  is  a  directed 
path  to  node  1  from  every  other  node.  Let  x^  be  the  estimate  of  the  shortest 
distance  from  node  i  to  node  1  available  at  node  i.  Consider  a  distributed 
algorithm  whereby  each  node  i=l,...,n  executes  the  iteration 


x1  *■  min  {a.  . +X?} 
1  jeN(i)  1J  3 


(3.1) 


after  receiving  one  or  more  estimates  xj  from  its  neighbors,  while  node  1  sets 


x*  =  0. 


This  algorithm--a  distributed  asynchronous  implementation  of  Bellman's  shortest 
path  algorithm--was  implemented  on  the  ARPANET  in  1969  [14].  The  estimate  x* 
can  be  shown  to  converge  to  the  unique  shortest  distance  from  node  i  to  node  1 
provided  the  starting  values  x*  are  nonnegative  [12] .  The  algorithm  clearly  is 
a  special  case  of  the  model  of  the  previous  section.  Here  the  measurement 
equation  [cf.  (2.3)]  is 


z1.  ■  xj  ,  V  jeN(i)  (3.2) 

the  measurement  update  equation  [cf.  (2.1)]  replaces  x*  by  z*  and  leaves  all 
other  coordinates  mj*j  unchanged,  while  the  corresponding  update  formula  of 
(2.2)  can  be  easily  constructed  using  (3.1). 
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Example  2:  (Fixed  point  calculations) 

The  preceding  example  is  a  speci  case  of  a  distributed  dynamic  programming 

algorithm  (see  [12])  which  is  itself  a  special  case  of  a  distributed  fixed  point 

algorithm.  Suppose  we  are  interested  in  computing  a  fixed  point  of  a  mapping 

F:  X-*X.  We  construct  a  distributed  fixed  point  algorithm  that  is  a  special  case 

of  the  model  of  the  previous  section  as  follows: 

Let  X  be  a  Cartesian  product  of  the  form  X=X,xX_  x...xX  and  let  us  write 

1  2  n 

accordingly  x=(x1,x2> . . .  ,xn)  and  F(x)  =  (Fj  (x)  ,F2(x)  , . . .  ,Fn(x))  where  F^  X-*^. 

Let  x1:=(x*, . . . ,x*)  be  the  estimate  of  x  generated  at  the  ith  processor.  Processor 
i  executes  the  iteration. 


X-  if  i»»j 
J  ^(x1)  if  i=j  , 


(3.3) 


(this  corresponds  to  the  mapping  of  (2. 2)), and  transmits  from  time  to  time 
x*  to  the  other  processors.  Thus  the  measurements  z*  are  given  by  [cf.  (2.3)] 


(3.4) 


and  the  (i,j)th  measurement  update  equation  [cf.  (2.1)]is  given  by 

i 

l  : 

i 


m 


if 


m 

zj  if  msj 


(3.5) 


Conditions  under  which  the  estimate  x1  converges  to  a  fixed  point  of  F  are 
given  in  [13]  (see  also  Section  4) . 
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Example  3:  (Distributed  deterministic  gradient  algorithm) 

This  example  is  a  special  case  of  the  preceding  one  whereby  X=  Rn, 
=  R,  and  F  is  of  the  form 


F(x)  =  x  -  ctff(x)  (3.6) 

where  Vf  is  the  gradient  of  a  function  f:  Rn  R,  and  a  is  a  positive  scalar 
stepsize.  Iteration  (3.3)  can  then  be  written  as 


l 

x . 
3 


if  i tj 


l 


3f(xi 

5x 


if  i=j 


(3.7) 


l 

A  variation  of  this  example  is  obtained  if  we  assume  that,  instead  of 
each  processor  i  transmitting  directly  his  current  value  of  the  coordinate  x^ 
to  the  other  processors,  there  is  a  measurement  device  that  transmits  the 


current  value  of  the  partial  derivative  — -  to  the  ith  processor.  In  this 

l 

case  there  is  only  one  type  of  measurement  for  each  processor  i  [cf.  (2.3)]  and 
it  is  given  by 


i 


3f(xJ,...,xJJ) 

9)T 

i 


While  the  equation  above  assumes  no  noise  in  the  measurement  of  each  partial 
derivative  one  could  also  consider  the  situation  where  this  measurement  is  cor¬ 
rupted  by  additive  or  multiplicative  noise  thereby  obtaining  a  model  of  a  dis- 

i 

tributed  stochastic  gradient  method.  Many  other  descent  algorithms  admit  a  similar 
distributed  version. 


-1S- 


Example  4:  (An  Organizational  Model) 

This  example  is  a  variation  of  the  previous  one,  but  may  be  also  viewed  as  a 
model  of  collective  decision  making  in  a  large  organization.  Let 

^  *  XjXX£  x...x  Xn  be  the  feasible  set,  where  X.  is  a  Euclidean  space  and  let 

n  • 

f:  X-*[0,«)  be  a  cost  function  of  the  form  f(x)  -  l  f  (x) .  We  interpret  f1 

i*l 

as  the  cost  facing  the  i-th  division  of  an  organization.  This  division  is  under 
the  authority  of  decision  maker  (processor)  i,  who  updates  the  i-th  component 
®  ^  the  decision  vector  x.  We  allow  the  cost  f1  to  depend  on  the  decisions 
Xj  of  the  remaining  decision  makers,  but  we  assume  that  this  dependence  is  weak. 
That  is,  let 


sup 

x6X 


aV  (x) 

3x .  3x 
J  m 


and  we  are  interested  in  the  case  (unless  j=m=i).  Decision  maker  i 

receives  measurements  z*,  j»l,...,n  of  the  form 


Zj(t)  =  ¥r  (x}(‘tJ1Ct))«x2(Tj2ct,) . 

i 


(3.8) 


where  (t)<  t  [cf.  (2.3)].  Once  in  a  while,  he  also  updates  his  decision 
according  to 


tz(t+l)  =  x*(t)  -  oti  l  z*(t)  . 

j»l  i 


If  we  assume  that 


fj"(t)<  (t),  V  i,j  ,o,t, 


(3.9) 
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the  above  algorithm  admits  the  following  interpretation:  each  decision  maker 

m,  at  time  x*m(t)  sends  a  message  xm(x*m(t)),  to  inform  decison  maker  j  of  his 
1  ®  J 

decision.  These  messages  are  the  last  such  messages  received  by  decision  maker  j 
no  later  than  x*^(t).  Then,  decision  maker  j  (who  is  assumed  to  be  knowledgeable 
about  f^)  computes  z J  according  to  (3.8)  and  sends  it  to  decision  maker  i;  the 
latter  message  is  the  last  such  message  received  by  decision  maker  i  no  later  than 
t,  and  is  being  used,  at  time  t,  by  decision  maker  i,  to  update  his  decision 
according  to  (3.9).  On  an  abstract  level,  each  decision  maker  j  is  being  informed 
about  the  decision  of  the  others  and  replies  by  saying  how  he  is  affected  by 
their  decisions;  however,  this  may  be  done  in  an  asynchronous  and  very  irregular 
manner . 

Example  5:  (Distributed  optimal  routing  in  data  networks) 

A  standard  model  of  optimal  routing  in  data  networks  (see  e.g.  the  survey 
[6])  involves  the  multicommodity  flow  problem 

minimize  £  D  (F  ) 

i  a  a 

aeA 

subject  to  F  =  J  ][  x  ,  vaeA 

weW  peP  p 
w 

aep 

l  x  =  r  ,  vweW 
pePw  P  W 

Xp  >  0,  vweW,  pePw  . 

Here  A  is  the  set  of  directed  links  in  a  data  network,  F  is  the 

£1 

communication  rate  (say  in  bits/sec)  on  link  aeA,  W  is  a  set  of  origin-destination 


i 


(' 


■4 
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(OD)  pairs,  Pw  is  a  given  set  of  directed  paths  joining  the  origin  and  the 
destination  of  OD  pair  w,  x^,  peP  is  the  communication  rate  on  path  p  of  OD 
pair  w,  r  is  a  given  required  communication  rate  of  OD  pair  w,  and  D  is  a 
monotonically  increasing  differentiable  convex  function  for  each  acS.  The 
objective  here  is  to  distribute  the  required  rates  r^  among  the  available  paths 
in  Pw  so  as  to  minimize  a  measure  of  average  delay  per  message  as  expressed  by 


l  W- 


acA 


pePw  it  is  convenient  to  use  a  distributed  algorithm  of  the  gradient  projection 

type  (see  [6], [8])  whereby  each  origin  iterates  on  its  own  path  rates  asynchronously 

and  independently  of  other  origins.  This  type  of  iteration  requires  knowledge  of 

the  first  partial  derivatives  D'(F  )  for  each  link  evaluated  at  the  current  link 

Si  3 

rates  F  .  A  practical  scheme  similar  to  the  one  currently  adopted  on  the 
ARPANET  [9]  is  for  each  link  aeA  to  broadcast  to  all  the  nodes  the  current  value  of 
either  F  or  D'(F  ).  This  information  is  then  incorporated  in  the  gradient  projec- 
tion  iteration  of  the  origin  nodes.  In  this  scheme  each  origin  node  can  be  viewed 
as  a  processor  and  F  or  D'(F  )  plays  the  role  of  a  measurement  which  depends  on 
the  solution  estimates  of  all  processors  [cf.  (2.3)]. 

The  direct  opposite  of  a  specialized  process,  in  terms  of  division  of  labor 
between  processors,  is  a  totally  overlapping  process. 


Example  6:  (Total  Overlap) 

Let  the  feasible  set  X  be  a  Euclidean  space.  Each  processor  i  receives 

measurements  zj  (jr*i)  which  are  the  values  of  the  estimates  x3  of  other  processors; 

that  is,  .  . 

•  x3  ,  if4 j  • 
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1 

( 

l 


Whenever  such  a  measurement  is  received,  processor  i  updates  his  estimate  by 
taking  a  convex  combination: 


x^M.^.zj)  *  ♦  ci-eij)  zj 


(3.10) 


where  0<g*^  <1.  Also  processor  i  receives  his  own  information  zf,  generated 
according  to 

Zi  * 

and  updates  x1  according  to 


xX(t*l}  -  Mu(x1(t),  zX(t))  *  xX(t)  -  axzx(t)  (3.11) 

where  a1  is  a  positive  scalar  stepsize.^  Such  an  algorithm  is  of  interest  if  the 
objective  is  to  minimize  a  cost  function  f:  X*-R,and  z^(t)  is  in  some  sense  a 
descent  direction  with  respect  to  f.  In  a  deterministic  setting,  such  a  scheme 
could  be  redundant,  as  some  processors  would  be  close  to  replicating  the  computation 
of  others.  In  a  stochastic  setting,  however  (e.g.  if 


2-U)  -  (x\t))  ♦  wX(t) , 

where  wX(t)  is  zero-mean  white  noise)  the  combining  process  is  effectively 
averaging  out  the  effects  of  the  noise  and  may  improve  convergence  . 


Example  7:  (System  Identification) 

Lng  average  p 

yX(t)  -  A(q)u(t)  ♦  wX(t), 


.  1  2 
Consider  two  moving  average  processes  y  (t) ,y  (t)  generated  according  to 


^The  stepsize  o  could  be  constant  as  in  deterministic  gradient  methods.  However,  in 
other  cases  (such  as  stochastic  gradient  methods  with  additive  noise)  it  is  essential 

that  a  is  time  varying  and  tends  to  zero.  This,  strictly  speaking,  violates  the  as¬ 
sumption  that  the  mapping  does  not  depend  on  the  time  t.  However  it  is  possible 

to  circumvent  this  by  introducing  (as  an  additional  component  of  x1)  a  local  counter 
at  each  processor  i  that  keeps  track  of  the  number  of  times  iteration  (3.10)  or  (3.11) 
is  executed  at  processor  i.  The  stepsize  o1  could  be  made  dependent  on  the  value  of 
this  local  counter  (see  the  discussion  following  (2.1)  and  (2.2)  in  Section  2). 


1 


T 
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where  A(-)  is  a  polynomial,  to  be  identified,  q  is  the  unit  delay  operator  and 
wx(t),  i=l,2,  are  white,  zero-mean  processes,  possibly  correlated  with  each  other. 
Let  there  be  two  processors  (n*2);  processor  i  measures  yx(t)  and  both  measure 
u(t)  at  each  time  t.  Each  processor  i  updates  his  estimate  x1  of  the  coefficients 
of  A  according  to  any  of  the  standard  system  identification  algorithms  (e.g.  the 
LMS  or  RLS  algorithm) .  Under  the  usual  identifiability  conditions  [33]  each 
processor  would  be  able  to  identify  A(*)  by  himself.  However,  convergence  should 
be  faster  if  once  in  a  while  one  processor  gets  (possibly  delayed)  measurements 
of  the  estimates  of  the  other  processor  and  combines  them  by  taking  a  convex 
combination.  Clearly,  this  is  a  special  case  of  Example  6. 

1  2 

A  more  complex  situation  arises  if  we  have  two  ARMAX  processes  y  ,  y  ,  driven 
by  a  common  colored  noise: 

A1 (q)yL (t)  -  B1 (q) u1 (t)  ♦  w(t) ,  i-1,2 

w(t)  =  C(q)v(t), 

where  v(t)  is  white  and  AX,BX,C  are  polynomials  in  the  delay  operator  q. 

Assuming  that  each  processor  i  observes  y1  and  u  ,  he  may  under  certain  conditions 
[34]  identify  Ax,Bl.  In  doing  this  he  must,  however,  identify  the  common  noise 
source  C  as  well.  So  we  may  envisage  a  scheme  whereby  processors  use  a  standard 
algorithm  to  identify  A1,  Bl,C  and  once  in  a  while  exchange  messages  with  their 
estimates  of  the  coefficients  of  C;  these  estimates  are  then  combined  by  taking 
a  convex  combination. 

This  latter  example  falls  in  between  the  extreme  cases  of  specialization  and 
total  o\erlap:  there  is  specialization  concerning  the  coefficients  of  AX,BX  and 
overlap  concerning  the  coefficients  of  C. 


f 

t 
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4.  Convergence  of  Contracting  Processes 

In  our  effort  to  develop  a  general  convergence  result  for  the  distributed 
algorithmic  model  of  Section  2  we  draw  motivation  from  existing  convergence  theories 
for  (centralized)  iterative  algorithms.  There  are  several  theories  of  this  type 
(Zangwill  [15],  Luenberger  [16],  Ortega  and  Rheinboldt  [17] — the  most  general  are 
due  to  Poljak  [18]  and  Polak  [19]).  Most  of  these  theories  have  their  origin  in 
Lyapunov's  stability  theory  for  differential  and  difference  equations.  The  main 
idea  is  to  consider  a  generalized  distance  function  (or  Lyapunov  function)  of  the 
typical  iterate  to  the  solution  set.  In  optimization  methods  the  objective  function 
is  often  suitable  for  this  purpose  while  in  equation  solving  methods  a  norm  of  the 
difference  between  the  current  iterate  and  the  solution  is  usually  employed.  The 
idea  is  typically  to  show  that  at  each  iteration  the  value  of  the  distance  function 
is  reduced  and  reaches  its  minimum  value  in  the  limit. 

The  result  of  this  section  is  based  on  a  similar  idea.  However  instead  of 
working  with  a  generalized  distance  function  we  prefer  to  work  (essentially)  with 
the  level  sets  of  such  a  function;  and  instead  of  working  with  a  single  processor 
iterate  (as  in  centralized  processes)  we  work  with  what  may  be  viewed  as  a  state  of 
computation  of  the  distributed  process  which  includes  all  current  processor  iterates 
and  all  latest  information  available  at  the  processors. 

The  subsequent  result  is  reminiscent  of  convergence  results  for  successive 
approximation  methods  associated  with  contraction  mappings.  For  this  reason  we 
refer  to  processes  satisfying  the  following  assumption  as  contracting  processes. 

In  what  follows  in  this  section  we  assume  that  the  feasible  set  X  in  the  model  of 
Section  2  is  a  topological  space  so  we  can  talk  about  convergence  of  sequences  in  X. 
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Assumption  3.1;  There  exists  a  sequence  of  sets  X(k)  with  the  following  properties: 

a)  X*C  X(k+l)CX(k)C...CX 

b)  If  {x^}  is  a 
to  a  solution. 

(Note:  If  the  notion  of  sequence  convergence  to  a  subset  is  defined  on  X,  one  may 
replace  convergence  of  {x^}  to  a  solution  with  convergence  to  the  solution  set  X*). 

c)  For  all  i,  j  and  k  denote: 


sequence  in  X  such  that  x^eX(k)  for  all  k,  then  {x^} converges 


ZjOO  =  (<}>i.(x1,...,xn/j)|x1eX(k)...,xneX(k),  (4.1) 

X*(k)  =  {C .(x\zj . zj;  )  |x1eX(k) ,  zjezj(k),...,z*  eZ*  (k)}  (4.2) 

mi  mi  mi 

ZT (k)  =  4ij(x1,...,xn,a))|x1eX1(k),...,xneXn(k),  weft}  (4.3) 

The  sets  X(k)  and  the  mappings  ,  and  (h  are  such  that  for  all  i,j  and  k 

Xi(k)CX(k)  (4.4) 

M  (x1>2J,...,2^)eX(k),  vxxeX(k),  zjezj(k),...,z*  eZ*  (k)  (4.5) 

j  i  mi  rai 

M  (x\zj,...,z*  )eXx  (k) ,  vx1EXi(k),  zjezj(k) , . . .  ,2*  eZ*  (k)  (4.6) 

Mi.(x1,zJ,...,z^  )eX(k+l),  vxxcXi(k),  zjelj(k)f...,z*  (k)  .  (4.7) 

Assumption  3.1  is  a  generalized  version  of  a  similar  assumption  in  reference 


[13].  Broad  classes  of  deterministic  specialized  processes  satisfying  the  assumption 
are  given  in  that  reference.  The  main  idea  is  that  membership  in  the  set  X(k)  is 
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representative  in  some  sense  of  the  proximity  of  a  processor  estimate  to  a  solution. 

By  part  b),  if  we  can  show  that  a  processor  estimate  succt esively  moves  from  X(0) 
to  X(l),  then  to  X(2)  and  so  on,  then  convergence  to  a  solution  is  guaranteed.  Part 
c)  assures  us  that  once  all  processor  estimates  enter  the  set  X(k)  then  they  remain  in 
the  set  X(k)£f.  (4.4),  (4.5)]  and  (assuming  all  processors  keep  on  computing  and 
receiving  measurements)  eventually  enter  the  set  X(k*lXcf.  (4.6) ,  (4.7)] .  In  view 
of  these  remarks  the  proof  of  the  following  result  is  rather  easy.  Note  that  the 
assumption  does  not  differentiate  the  effects  of  two  different  members  of  the 
probability  space  [cf.  part  c)]  so  it  applies  to  situations  where  the  process  is 
either  deterministic  (fl  consists  of  a  single  element),  or  else  stochastic  variations 
are  not  sufficiently  pronounced  to  affect  the  membership  relations  in  part  c) . 


Proposition  3.1:  Let  Assumptions  2.1,  2.2,  3.1,  hold  and  assume  that  all  initial 
processor  estimates  xl,  i=l,...,n  belong  to  X(0),  while  all  initial  measurements 
z*  available  at  the  processors  belong  to  the  corresponding  sets  Z*(0).  Then  each  of 


of  the  sequence  {x*}  converges  almost  surely  to  a  solution  as  t-*«. 

The  proof  will  not  be  given  since  it  is  very  similar  to  the  one  given  in  [13]. 
Note  that  the  proposition  does  not  guarantee  asymptotic  agreement  of  the  processor 
estimates  but  in  situations  where  Assumption  3.1  is  satisfied  one  can  typically 
also  show  agreement. 


Example  2  (continued):  As  an  illustration  consider  the  specialized  process  for 
computing  a  fixed  point  of  a  mapping  F  in  example  2.  There  X  is  a  Cartesian 
product  Xj  x  x...x  Xn>  and  each  processor  i  is  responsible  for  updating  the 


t 
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ith  "coordinate"  x^  of  x  *  (x^x^, . • . ,xn)  while  relying  on  essentially  direct 
communications  from  other  processors  to  obtain  estimates  of  the  other  coordinates. 

Suppose  that  each  set  is  a  Banach  space  with  norm  j  |  •  |  j  and  X  is  endowed  with 
the  sup  norm 

||x!|  -  max{||x1||1,...,||xn|(n},  VxeX  (4.8) 

Assume  further  that  F  is  a  contraction  mapping  with  respect  to  this  norm, 
i.e.,  for  some  ae(0,l). 

!  |F(x)-F(y)  1 1  <a||x-y||,  Vx.yeX  .  (4.9) 

Then  the  solution  set  consists  of  the  unique  fixed  point  x*  of  F.  For  some 
positive  constant  B  let  us  consider  the  sequence  of  sets 

X(k)  *  {xeX|  | ( x-x* | |  <  Bak}  ,  k=0,l,.... 

The  sets  defined  by  (4.1) -(4. 3)  are  then  given  by 

z}(k)  =  (x.exJ  ||x.-x*||  <  Bak} 

J  J  J  J  J  j 

X*(k)  =  {xeX(k)|  | | x. -x* | |  <  Bak+1} 

1  1  i  — 

zj(k)  =  (XjCXj  [  ||x.-xj||.  <  Botk-1}  . 

It  is  straightforward  to  show  that  the  sequence  (X(k)}  satisfies  Assumption  3.1. 

Further  illustrations  related  to  this  example  are  given  in  [13] .  Note  however  that 
the  use  of  the  sup  norm  (4.8)  is  essential  for  the  verification  of  Assumption  3. 

Similarly  Assumption  3  can  be  verified  in  the  preceding  example  if  the  contraction 
assumption  (4.9)  is  substituted  by  a  monotonicity  assumption  (see  [13]).  This  mono¬ 
tonicity  assumption  is  satisfied  by  most  of  the  dynamic  programming  problems  of  interest 
including  the  shortest  path  problem  of  example  1  (see  also  [12]).  An  important  exception 
is  the  infinite  horizon  average  cost  Markovian  decision  problem  (see  [12],  p.  616). 
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An  important  special  case  for  which  the  contraction  mapping  assumption 
(4.9)  is  satisfied  arises  when  X=Rn  and  Xj,x2>...,xn  are  the  coordinates  of  x. 
Suppose  that  F  satisfies 

|F(x)-F(y)|<  P|x-y|,  Vx,yeRn  (4.10) 

where  P  is  an  nxn  matrix  with  nonnegative  elements  and  spectral  radius  strictly 
less  than  unity,  and  for  any  z=(z^ ,z^, . . . ,  z^)  we  denote  by  | z |  the  column  vector 
with  coordinates  j z^ | , | z^ | , . . . , j zn| .  Then  F  is  called  a  P-contraction  mapping. 
Fixed  point  problems  involving  such  mappings  arise  in  dynamic  programming  ([20], 
p.374),  and  solution  of  systems  of  nonlinear  equations  ([17],  Section  13.1).  It 
can  been  shown  ([11],  p.231)  that  if  F  is  a  P-contraction  then  it  is  a  contraction 
mapping  with  respect  to  some  norm  of  the  form  (4.8).  Therefore  Proposition  3.1 
applies. 

We  finally  note  that  it  is  possible  to  use  Proposition  3.1  to  show  convergence 
of  similar  fixed  point  distributed  processes  involving  partial  or  total  overlaps 
between  the  processors  (compare  with  example  6) . 

Example  3  (continued) :  Consider  the  special  case  of  the  deterministic  gradient 
algorithm  of  example  3  corresponding  to  the  mapping 


F(x)  =  x  -  aVf(x)  . 


(4.11) 


Assume  that  f:R  -*■  R  is  a  twice  continuously  differentiable  convex  function  with 
2 

Hessian  matrix  7  f(x)  which  is  positive  definite  for  all  x.  Assume  also  that  there 
exists  a  unique  minimizing  point  x*  of  f  over  Rn.  Consider  the  matrix 


H* 


32f 

92f 

32f 

Oxx)2 

9  ” 

3x1  3x2 

3x.  3x 

1  n 

2 

3  r 

32f 

32f 

3x  3x. 
n  1 

9  * 

9  •  •  •  9 

Oxn)2 

(4.12) 
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2 

obtained  from  the  Hessian  matrix  V  f(x*)  by  replacing  the  off-diagonal  terms 
by  their  negative  absolute  values.  It  is  shown  in  [13]  that  if  the  matrix  H* 
is  positive  definite  then  the  mapping  F  of  (4.11)  is  a  P-contraction  within  some 
open  sphere  centered  at  x*  provided  the  stepsize  a  in  (4.11)  is  sufficiently  small. 
Under  these  circumstances  the  distributed  asynchronous  gradient  method  of  this 
example  is  convergent  to  x*  provided  all  initial  processor  estimates  are  sufficiently 
close  to  x*  and  the  stepsize  a  is  sufficiently  small.  The  neighborhood  of  local 
convergence  will  be  larger  if  the  matrix  (4.12)  is  positive  definite  within  an  accord¬ 
ingly  larger  neighborhood  of  x*.  For  example  if  f  is  positive  definite  quadratic  with 
the  corresponding  matrix  (4.12)  positive  definite  a  global  convergence  result  can  be 
shown. 

One  condition  that  guarantees  that  H*  is  positive  definite  is  strict  diagonal 
dominance  ([17],  p.48-51). 

,  f 

Ox.)2  j-1 

where  the  derivatives  above  are  evaluated  at  x*.  This  type  of  condition  is 
typically  associated  with  situations  where  the  coordinates  of  x  are  weakly  coupled 
in  the  sense  that  changes  in  one  coordinate  have  small  effects  on  the  first  partial 
derivatives  of  f  with  respect  to  the  other  coordinates.  This  result  can  be  general¬ 
ized  to  the  case  of  weakly  coupled  systems  (as  opposed  to  weakly  coupled  coordinates) . 

m. 

Assume  that  x  is  partitioned  as  x=(x^ . . . ,x^)  where  now  x^6R  (nn  may  be 
greater  than  one  but  all  other  assumptions  made  earlier  regarding  f  are  in  effect). 
Assume  that  there  are  n  processors  and  the  ith  processor  asynchronously  updates  the 
subvector  x^  according  to  an  approximate  form  of  Newton's  method  where  the  second 


a2f 


3x73x7 
i  J 


V i=l , . . . ,n. 


-26- 


order  submatrices  of  the  Hessian  V  f,  i^j  are  neglected,  i.e. 

Vj 

•  C4-13) 

11  1 

In  calculating  the  partial  derivatives  above  processor  i  uses  the  values  x^ 

latest  communicated  from  the  other  processors  j^i  similarly  as  in  the  distributed 

2 

gradient  method.  It  can  be  shown  that  if  the  cross-Hessians  V  f,  i^j  have 

X  •  A  • 

1  J 

2 

sufficiently  small  norm  relative  to  V  f,  then  the  totally  asynchronous  version 

V  i 

of  the  approximate  Newton  method  (4.13)  converges  to  x*  if  all  initial  processor 
estimates  are  sufficiently  close  to  x*.  The  same  type  of  result  may  also  be  shown 


if  (4.13)  is  replaced  by 


xi  *- arg  minm  f  (x^x^  . . .  ,xn) 


(4.14) 


Unfortunately  it  is  not  true  always  that  the  matrix  (4.12)  is  positive 
definite,  and  there  are  problems  where  the  totally  asynchronous  version  of  the 
distributed  gradient  method  is  not  guaranteed  to  converge  regardless  of  how  small 
the  stepsize  a  is  chosen.  As  an  example  consider  the  function  f:  R3— *R 

f(x1>x2,x3)=(x1+x2+x3)2  +  (x1+x2+x3*3)2  +  e(x2+x2+x2) 

where  0<e«l.  The  optimal  solution  is  close  to  (j,  j,  j)  for  e:  small.  The  scalar 
e  plays  no  essential  role  in  this  example.  It  is  introduced  merely  for  the  purpose 
of  making  the  Hessian  of  f  positive  definite.  Assume  that  all  initial  processor 
estimates  are  equal  to  some  common  value  x,  and  that  processors  execute  many  gradient 
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iterations  with  a  small  stepsize  before  communicating  the  current  values  of 
their  respective  coordinates  to  other  processors.  Then  (neglecting  the  terms 
that  depend  on  e)  the  ith  processor  tries  in  effect  to  solve  the  problem 

_  2  _  2 

min  {(x.+2x)  +  (x.+2x-3)  } 

xi  1  1 

thereby  obtaining  a  value  close  to  j  -2x.  After  the  processor  estimates  of  the 
respective  coordinates  are  exchanged  each  processor  coordinate  will  have  been 
updated  approximately  according  to 

x—  j  -2x  (4.15) 

and  the  process  will  be  repeated.  Since  (4.15)  is  a  divergent  iterative  process 
we  see  that,  regardless  of  the  stepsize  chosen  and  the  proximity  of  the  initial 
processor  estimates  to  the  optimal  solution,  by  choosing  the  communication  delays 
sufficiently  larve  the  distributed  gradient  method  can  be  made  to  diverge  when  the 
matrix  H*  of  (4.12)  is  not  positive  definite. 

5.  Convergence  of  Descent  Processes 

We  saw  in  the  last  section  that  the  distributed  gradient  algorithm  converges 
appropriately  when  the  matrix  (4.12)  is  positive  definite.  This  assumption  is  not 
always  satisfied,  but  convergence  can  be  still  shown  (for  a  far  wider  class  of 
algorithms)  if  a  few  additional  conditions  are  imposed  on  the  frequency  of  obtaining 
measurements  and  on  the  magnitude  of  the  delays  in  equation  (2.3).  The  main  idea 
behind  the  results  described  in  this  section  is  that  if  delays  are  not  too  large,  if 
certain  processors  do  not  obtain  measurements  and  do  not  update  much  more  frequently  than 
others,  then  the  effects  of  asynchronism  are  relatively  small  and  the  algorithm  behaves 
approximately  as  a  centralized  algorithm,  similar  to  the  class  of  centralized  pseudo¬ 
gradient  algorithms  considered  in  (40). 
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Let  X  *  XjXX2  x...x  X^  be  the  feasible  set,  where  X^  (2*1,...,L)  is  a  Banach 
space.  If  x*(x. , . . . ,x  ) ,  x„6X  ,  we  refer  to  x  as  the  2-th  component  of  x.  We  endow 
X  with  the  sup  norm,  as  in  (4.8).  Let  f:x-^[0,®)  be  a  cost  function  to  be  minimized. 

We  assume  that  f  is  Frechet  differentiable  and  its  derivative  is  Lipschitz  continuous. 
Each  processor  i  keeps  in  his  memory  an  estimate  xx(t)  ■  (x* (t) , . . . ,x*(t))6X 

and  receives  measurements  z*  .  ex. ,  ijf  j ,  with  the  value  of  the  2-th  component  of  x^ , 

J » x  x, 

evaluated  by  processor  j  at  some  earlier  time  t1  .  (t)<  t;  that  is, 

j » *• 

z1  .(t)  =  x^t1  „ (t ) ) .  He  also  receives  from  the  environment  exogenous,  possibly 
stochastic  measurements  z*  EX,  which  are  in  a  direction  of  descent  with  respect 
to  the  cost  function  f,  in  a  sense  to  be  made  precise  later.  We  denote  by  z1  .  the 
2.-th  component  of  z*. 

Whenever  processor  i  receives  measurements  z1  .,  he  updates  his  estimate 

J  >  x 

vector  x  componentwise,  according  to: 

xj(fl)  =  4(t)xj(t)  ♦  l  ej  ^Ctjzj  4(t)  ♦  a^tjz*  £(t)  .  (5.1) 

’  j^i  J '  ' 


The  coefficients  6.  .(t)  are  nonnegative  scalars  satisfying 

J  » x 


l  ei  =  vi,2,t, 

j=l  J,Jt 

and  such  that:  if  no  measurement  z1  .  was  received  by  processor  i  (i^j)  at  time  t, 

J » 

then  B1  f(t)=0.  That  is,  processor  i  combines  his  estimate  of  the  2-th  component 
j  »x 

of  the  solution  with  the  estimates  (possibly  outdated)  of  other  processors  that  he 
has  just  received,  by  forming  a  convex  combination.  Also,  if  no  new  measurement  z* 
was  obtained  at  time  t,  we  should  set  z*(t)=0  in  equation  (5.1).  The  coefficient 
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ot  (t)  is  a  nonnegative  stepsize.  It  can  be  either  independent  of  t  or  it  may 
depend  on  the  number  of  times  up  to  t  that  a  new  measurement  (of  any  type)  was 
received  at  processor  i. 

Equation  (5.1)  which  essentially  defines  the  algorithm,  is  a  linear  system 
driven  by  the  exogenous  measurements  z^(t).  Therefore,  there  exist  linear  operators 
(t|s), (t>s),  such  that 


.  n  t-1  n  .  . 

x  (t)  =  l  *1J  (t|0)xJ  (1)  ♦  l  l  a3  (s) (t|s)z^  (s)  . 
j=l  s=l  j=l  3 


We  now  impose  an  assumption  which  states  that  if  the  processors  cease  obtaining 
exogenous  measurements  from  some  time  on  (that  is,  if  they  set  zSO) ,  they  will 
asymptotically  agree  on  a  common  limit: 

gumption  5.1:  For  any  i,j,s,  lim  $i;’(t|s)  exists  (with  respect  to  the  induced 

t-Kn 

operator  norm)  and  is  the  same  for  all  i.  The  common  limit  is  denoted  by  t3  (s) . 

Assumption  5.1  is  very  weak.  Roughly  speaking  it  requires  that  for  every 
component  X£{1,...,L}  there  exists  a  directed  graph  G=(W,A),  where  the  set  N  of  nodes 
is  the  set  {l,...,n}  of  processors,  and  such  that  there  exists  a  path  from  every  processor 
to  every  other  processor.  Also  the  coefficients  &3  p(t)  must  be  such  that  "sufficient 
combining"  takes  place  and  the  processors  tend  to  agree. 

We  can  now  define  a  vector  y(t)6X  by 


t-1  n 


y(t)  ■  l  *J(0)xJ(l)  +  l  l  oj(s)*:i(s)zj(s) 
j*l  s«l  j*l  3 


i 
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and  observe  that  y(t)  is  recursively  generated  by 

n 

y(t*l)  -  y(t)  ♦  l  o1(t)g!l(t)zJ,(t)  .  (S  .2) 

i-1 

We  can  now  explain  the  main  idea  behind  the  results  to  be  described: 
if  (t|s)  converges  to  (s)  fast  enough,  if  aX(t)  is  small  enough,  and  if 

z^(t)  is  not  too  large,  then  x^t),  for  each  i,  will  evolve  approximately  as  y(t). 

We  may  then  study  the  behavior  of  the  recursion  (5.2)  and  make  inferences  about 
the  behavior  of  xx(t). 

Hie  above  framework  covers  both  specialized  processes,  in  which  case  we  have 

L=n,  as  well  as  the  case  of  total  overlap  where  we  have  L=1  and  we  do  not  distinguish 

between  components  of  the  estimates.  For  specialized  processes  (e.g.  example  3)  it 

is  easy  to  see  that  y(t)  -  (xj(t)  .x^Ct)  , . . .  ,xJJ(t))  . 

We  now  proceed  to  present  some  general  convergence  results.  We  allow  the 

exogenous  measurements  z^  of  each  processor,  as  well  as  the  initialization  x1(l) 

of  the  algorithm  to  be  random  (with  finite  variance).  We  assume  that  they  are  all 

defined  on  a  probability  space  (fi,F,P)  and  we  denote  by  the  o-algebra  generated 

by  (x1(l) ,z^(s) ;  i=l,...,n;  s*l . t-1}.  We  assume,  however,  that  the  sequence  of 

times  at  which  measurements  are  obtained,  computations  are  performed,  the  times 

T .  0(t),  as  well  as  the  combining  coefficients  8*  .(t)  are  deterministic.  (In  fact, 

J  »*’  )  **- 

this  assumption  may  be  often  relaxed) .  In  order  to  quantify  the  speed  of  convergence 
of  ^ (tjs)  we  introduce 

c(t|s)  »  max  |#iJ  (t|s)  -  4>^(s)|  . 

By  Assumption  5.1  lim  c(t|s)«0  and  it  may  be  shown  that  c(t|s)<  1,  Vt,s.  Consider 

t-*»  - 

the  following  assumptions: 


<•* '*■*>*!.  M*'-- 


,'ivHSo 
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Assumption  5.2: 


E 


(x^t)),  ♦1(t)zJ(t)> 


Vt,i,  a.s. 


Assumption  5.3: 


a)  For  some  K  >  0 

o — 

E[llzi(t)l|2]  -  '  KoE[<M  Cxlft)).^(t)zJ(t)>]  ,  Vi.t  . 

b)  For  some  B>0,  d€[0,l),  ^(tjs^Bd*  S,  Vt>s,  Vs. 


Assumption  S.2  states  that  ^(tjz^t)  (which  is  the  "effective  update  direction" 
of  processor  i,  see  (5.2))  is  a  descent  direction  with  respect  to  f.  Assumption 
5.3a  requires  that  z^(t)  is  not  too  large.  In  particular  any  noise  present  in 

z*(t)  can  only  be  "multiplicative-like":  its  variance  must  decrease  to  zero  as  a 
stationary  point  of  f  is  approached.  For  example,  we  may  have 


zj(t)  *  - 


df 

5^ 


(x*(t)) 


] 


U+W1  (t)), 


where  wx(t)  is  scalar  white  noise.  Finally,  Assumption  5.3b  requires  that  the 
processors  tend  to  agree  exponentially  fast.  Effectively,  this  requires  that  the 
time  between  consecutive  measurements  of  the  type  z^  g,  it j ,  as  well  as  the  delays 

t-T^  „ (t)  are  bounded  together  with  some  minor  restriction  of  the  coefficients  6*  . (t) 

j 

for  those  times  that  a  measurement  of  type  zZ  „  is  obtained. 

J  •x' 
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Letting 

a  =  sup  ax(t)  , 
t.i 

we  may  use  Assumptions  5.3a  and  5.3b  to  show  that  | | x1 Ct) -y (t) | |  is  of  the  order 
of  aQ.  Using  the  Lipschitz  continuity  of  ||  it  follows  that  1 1||  (y(t))-  ||  (x1(t))|| 

is  also  of  the  order  of  aQ;  then,  using  Assumption  5.2,  it  follows  that  (5.2) 
corresponds  to  a  descent  algorithm,  up  to  first  order  in  aQ.  Choosing  aQ  small  enough, 
convergence  may  be  shown  by  invoking  the  supermartingale  convergence  theorem.  The 
above  argument  can  be  made  rigorous  and  yields  the  following  proposition  (me  proof 
may  be  found  in  [35]  and  in  a  forthcoming  publication) : 

Proposition  S.l:  If  Assumptions  5.1,  5.2,  5.3,  hold  and  if  aQ  is  small  enough, 
then: 

a)  f(xx(t)),  i*l,...,n,  as  well  as  f(y(t))  converge,  almost  surely,  and  to  the 
same  limit. 

b)  lim  (xx(t)-y(t))=0,  Vi,  almost  surely  and  in  the  mean  square. 

t-*«o 

OO  _ 

c)  l  l  ^(tJBKff  (xi(t)),*i(t)zi(t)>  |Ftl>-»  ,  (5.3) 

t=l  i=l  L  dX  1 

almost  surely.  The  expectation  of  the  above  expression  is  also  finite. 

A  related  class  of  algorithms  arises  if  the  noise  in  z^(t)  is  allowed  to  be 
additive,  e.g. 

z\it)  «  ||  (x^t))  ♦  w*(t)  , 
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where  w1(t)  is  zero-mean  white.  In  such  a  case,  an  algorithm  may  be  convergent 
only  if  lim  ci^t)*©.  In  fact,  o^OO-l/t.,  where  t.  is  the  number  of  times  up  to 

t-**>  1 

time  t  that  a  new  measurement  was  received  at  i,  is  the  most  convenient  choice,  and 

this  is  what  we  assume  here.  However,  this  choice  of  stepsize  implies  that  the 

algorithm  becomes  progressively  slower,  as  t-*®.  We  may  therefore.' allow  the  agreement 

process  to  become  progressively  slower  as  well,  and  still  retain  convergence.  In 

physical  terms,  the  time  between  consecutive  measurements  z*  .(i/j)  may  increase 

J  »* 

to  infinity,  as  t-*°°.  In  mathematical  terms: 

Assumption  5.4:  a)  For  some  Kq,  K^,  >  0, 

t[i|z.(t)!|2]  i  -  K0e[<|£  (xA ct) )  .♦i (t J z J (t)>]  + 

*  K^jffxS))]  ♦  K2 

b)  'For  some  B>0,  66(0,1],  d6[0,l) 

c(t|s)£  Bd  ,  Vt>5,  Vs. 

We  then  have  [3S]: 

Proposition  S.2  :  Lei  ^(tj^l/t where  t.^  is  the  number  of  times  up  to  time  t  that  a  new 
measurement  was  received  at  i,  and  assume  that  for  some  e>0,  t>e*t  for  all  i,t.  Asumme 
also  that  Assumptions  5.1,  5.2,  5.4  hold.  Then  the  conclusions  (a),(b),(c)  of 
Proposition  5.1  remain  valid. 

Propositions  5. 1,5. 2  do  not  prove  yet  convergence  to  the  optimum  (suppose,  for 
example,  that  z^(t)=0,  Vi,t).  However,  (5.3)  may  be  exploited  to  yield  optimality 
undeT  a  few  additional  assumptions: 
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Corollary:  Let  the  assumptions  of  either  Proposition  5.1  or  5.2  hold.  Let  T*  be 
the  set  of  times  that  processor  i  obtains  a  measurement  of  type  z^.  Suppose  that 
there  exists  some  B>0  and,  for  each  i,  a  sequence  {t£}  of  distinct  elements  of  T1 
such  that 

max  |tf-t-*  |<  B  ,  (5.4) 

i»j 

00 

l  minta1 (t*) }*»  . 
k*l  i  K 

Finally,  assume  that  there  exist  uniformly  continuous  functions:  g1  :x-*-[0,°») 
satisfying 

n 

a)  lim  inf  Y  gX(x)>  0 

|x|~>  i=l 

b)  )  1^  *  &  (x  (t))»  vt£T^,  Vi,  almost  surely. 


It  •  A 

c)  1  g  (x*)«0  »>  X*ex*  =  {x€Xj  f(x*)  *  inf  f(x)> 
i*l  x 

Then,  lim  f(xx(t))  =  inf  f(x),  Vi,  almost  surely, 
t-*®  x 


Example  3:  (continued) :  It  follows  from  the  above  results  that  the  distributed 
deterministic  gradient  algorithm  applied  to  a  convex  function  converges  provided  that 
a)  The  stepsize  a  is  small  enough,  b)  Assumption  5.3(b)  holds  and  c)  The  processors 
update,  using  (3.7),  regularly  enough,  i.e.  condition  (5.4)  is  satisfied.  Similarly, 
convergence  for  the  distributed  stochastic  gradient  algorithm  follows  if  we  choose 
a  stepsize  a^t)*!/^,  if  Assumption  5.4  and  condition  (5.4)  hold. 


1  '( 
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Example  4:  (continued)  Similarly  with  the  previous  example,  convergence  to 
stationary  points  of  f  may  be  shown,  provided  that  cl  is  not  too  large,  that  the 
delays  t-x^m(t)  are  not  too  large  and  that  the  processors  do  not  update  too 
irregularly.  It  should  be  pointed  out  that  a  more  refined  set  of  sufficient  con¬ 
ditions  for  convergence  may  be  obtained,  which  links  the  "coupling  constants" 

K*  ^  with  bounds  on  the  delays  t-x^Ct)  [35].  These  conditions  effectively 
quantify  the  notion  that  the  time  between  consecutive  communications  and  com¬ 
munication  delays  between  decision  makers  should  be  inversely  proportional  to  the 
strength  of  coupling  between  their  respective  divisions. 


Example  7:  (continued)  Several  common  algorithms  for  identification  of  a  moving 
average  process  satisfy  the  conditional  descent  Assumption  5.2.  (e.g.  the  Least 

Mean  Squares  algorithm,  or  its  normalized  version-NLMS) .  Consequently,  Proposition  5.2 
may  be  invoked.  Using  part  (c)  of  the  Proposition,  assuming  that  the  input  is 
sufficiently  rich  and  that  enough  messages  are  exchanged,  it  follows  that  the  dis¬ 
tributed  algorithm  will  correctly  identify  the  system.  A  detailed  analysis  is  given 
in  [35]. 

A  similar  approach  may  be  taken  to  analyze  distributed  stochastic  algorithms 
in  which  the  noises  are  correlated  and  Assumption  5.2  fails  to  hold.  Very  few  global 
convergence  results  are  available  even  for  centralized  such  algorithms  [34,36]  and  it 
is  an  open  question  whether  some  distributed  versions  of  them  also  converge.  However,  as 
in  the  centralized  case  one  may  associate  an  ordinary  differential  equation  with 
such  an  algorithm  as  in  [37,38],  and  prove  local  convergence  subject  to  an 
assumption  that  the  algorithm  returns  infinitely  often  to  a  bounded  region  (see  [35]). 
Such  results  may  be  used,  for  example,  to  demonstrate  local  convergence  of  a  distri¬ 
buted  extended  least  squares  (ELS)  algorithm,  applied  to  the  ARMAX  identification 
problem  in  Example  7. 
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6.  Convergence  of  Distributed  Processes  with  Bayesian  Updates 

In  Section  4  and  5  we  considered  distributed  processes  in  which  a  solution  is 
being  successively  approximated,  while  the  structure  of  the  updates  is  restricted 
to  be  of  a  special  type.  In  this  section  we  take  a  different  approach  and  we  assume 
that  the  estimate  computed  by  any  processor  at  any  given  time  is  such  that  it 
minimizes  the  conditional  expectation  of  a  cost  function,  given  the  information 
available  to  him  at  that  time.  Moreover,  all  processors  "know"  the  structure  of  the 
cost  function  and  the  underlying  statistics,  and  their  performance  is  only  limited 
by  the  availability  of  posterior  information.  Whenever  a  processor  receives  a 
measurement  z*  (possibly  containing  an  earlier  estimate  of  another  processor)  his 
information  changes  and  a  new  estimate  my  be  computed. 

Formally,  let  X=  Rm  be  the  feasible  set,  (ft,F,P)  a  probability  space  and 
f:  Xxi>>[0,°°)  a  random  cost  function  which  is  strongly  convex  in  x  for  each  coEQ.  Let 

I 1 Ct )  denote  the  information  of  processor  i  at  time  t,  which  generates  a  o-algebra 
F*  C  F.  At  any  time  that  the  information  of  processor  i  changes,  he  updates  his 
estimate  according  to 

x*(t+l)  =  arg  min  E[  f  (x,u>)  |F*]  (6.1) 

xex 

Assuming  that  f  is  jointly  measurable,  this  defines  an  almost  surely  unique,  F*  - 
measurable  random  variable  [39]. 

The  information  Ix(t)  of  processor  i  may  change  in  one  of  the  following  ways: 
a)  New  exogenous  measurements  z^(t)  are  obtained,  so  that  Ir(t)  «  (IX(t-l) ,z^(t)) . 


r  r 
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b)  Measurements  z^(t)  with  the  value  of  an  earlier  estimate  of  processor  i  are 
obtained;  that  is, 


Zj(t)  =  xj(xj(t));  xj(t)<t 

'  /  (6.2) 

IA(t)  =  (I^t-l),  zj(t)) 

c)  Some  information  in  I1 (t-1)  may  be  "forgotten";  that  is,  I1(t)C X1  (t-1 ) 

(or  FjcFj.p. 

The  times  at  which  measurements  are  obtained  as  well  as  the  delays  are  either 
deterministic  or  random;  if  they  are  random,  their  statistics  are  described  by  (fi,F,P) 
and  these  statistics  are  known  by  all  processors. 

Case  1;  Increasing  Information.  We  start  by  assuming  that  information  is  never 
forgotten,  i.e.  F*+1  3  F*,  Vi,t.  Let  f(x,w)  =  1 1  x-x*  (<*>)  1 1  ,  where  x*:Cl  Rm  is 
an  unknown  random  vector  to  be  estimated.  Then, 

xi(t+l)  =  E[x*(u))jF^] 

and  by  the  martingale  convergence  theorem,  x*(t)  converges  almost  surely  to  a  random 

variable  y  .  Moreover  it  has  been  shown  that  if  "enough"  measurements  of  type  (6.2)  are 

obtained  by  each  processor,  then  yX=yJ ,  Vi,j,  almost  surely  [30,41].  If  f  is  not  quadratic 

but  strongly  convex,  the  same  results  are  obtained  except  that  convergence  holds 
,  2 

in  the  sense  of  probability  and  in  the  L  (ft,F,y)  sense,  where  y  is  a  measure  equivalent 
to  P,  determined  by  the  function  f  [39],  However,  this  scheme  is  not,  strictly 
speaking,  iterative,  since  IX(t)  increases,  and  unbounded  memory  is  required. 


Case  2:  Iterative  schemes 

The  above  scheme  can  be  made  iterative  if  we  allow  processors  to  forget  their 
past  information.  For  example,  let 

.  (xi(t),  z*(t)},  if  a  measurement  z^(t)  is  obtained  at  time  t 

r(t)  «  J  * 

(x1(t)},  otherwise 
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Let  zX(t)  =  x-*(TX(t)),  i/j,  T^(t)£t.  Assuming  that  "enough"  measurements  of  this 
type  are  obtained  by  each  processor,  asymptotic  agreement  may  be  still  obtained, 
as  for  Case  1  [39].  It  has  been  also  shown  that  xX(t+l)  -  xx(t)  converges  to  zero, 
for  each  i,  but  it  is  not  known  whether  x1(t)  is  guaranteed  to  converge  or  not. 

Even  though  this  case  corresponds  to  an  iterative  algorithm,  it  may  be  very 
hard  to  implement:  The  computation  of  the  minimum  in  (6.1)  may  be  intractable.  Also, 
even  if  the  processors  asymptotically  converge  and  agree,  there  are  no  guarantees 
in  general  about  the  quality  of  the  final  estimate.  There  is  one  notable  exception 
where  these  drawbacks  disappear,  which  we  discuss  below: 


Case  3:  Distributed  Linear  Estimation 

Let  f(x,u)  =  j|x-x*(ui)||  ,  where  x*  is  a  zero-mean  Gaussian  scalar  random 

variable  to  be  estimated.  Suppose  that  at  time  zero  each  processor  obtains  measurements 


zx.  .  =  x*  +  wx,  k=l,...,m,  , 

1  >  K  K  1 


(6.3) 


where  w^  are  zero-mean  Gaussian  noises.  We  allow  the  noises  of  different  processors 
to  be  correlated  to  each  other.  Let  IX(0)  =  (z1  |k=l . m.}.  No  further  measurements 

1  f  K  X 

of  the  form  (6.3)  are  obtained  after  time  zero.  Subsequently  each  processor  i  receives 
from  time  to  time  measurements  zX(t)  =  t?  (tX (t)) ,tX (t)<  t,  of  the  other  processors' 
estimates  and  updates  according  to 


x1  (t+1)  =  E[x*|ll(0).  zj(t)]  . 

The  timing  and  delay  of  these  latter  measurements  is  assumed  to  be  deterministic.  If 
we  make  the  assumption  that  an  infinite  number  of  measurements  of  each  type  zf  is  obtained! 
by  each  processor  i,  together  with  an  additional  assumption  that  essentially  requires 
that  there  exists  an  indirect  communicatio’  ,,ath  between  every  pair  of  processors 
then  it  can  be  shown  that  xx(t)  converges  the  mean  square  to  the  centralized  estimate 
x*  *  E[x*|l1(0),...,In(0)], 

which  is  the  optimal  estimate  of  x*  given  the  total  information  of  all  processors  [35], [31 
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What  is  interesting  about  the  above  algorithm  is  that  it  corresponds  to  a 

distributed  iterative  decomposition  algorithm  for  solving  the  centralized  linear 

estimation  problem.  The  minimization  of  the  cost  criterion  over  a  space  of 
n 

dimension  £  m. ,  in  general,  is  substituted  by  a  sequence  of  minimizations 
i=l  1 

along  (rn^+1) -dimensional  subspaces. 

If  the  noises  w*,  w^,  ij*j ,  are  independent  the  algorithm  converges  after 
finitely  many  iterations.  In  general,  the  algorithm  converges  linearly  but  the 
rate  of  convergence  depends  strongly  on  the  number  of  processors  and  the  angles 
between  certain  subspaces  of  random  variables  (essentially  on  the  correlations 
between  w*  and  w^,  i/j,  see  [35], [39]). 
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